Overview

Dataset statistics

Number of variables3
Number of observations125086
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory2.9 MiB
Average record size in memory24.0 B

Variable types

Categorical1
Numeric2

Alerts

tconst has a high cardinality: 125086 distinct valuesHigh cardinality
numVotes is highly skewed (γ1 = 47.57718569)Skewed
tconst is uniformly distributedUniform
tconst has unique valuesUnique

Reproduction

Analysis started2022-11-23 20:36:18.208470
Analysis finished2022-11-23 20:39:10.835837
Duration2 minutes and 52.63 seconds
Software versionpandas-profiling vdev
Download configurationconfig.json

Variables

tconst
Categorical

HIGH CARDINALITY
UNIFORM
UNIQUE

Distinct125086
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Memory size977.4 KiB
tt0000001
 
1
tt2177418
 
1
tt2177318
 
1
tt2177167
 
1
tt2177145
 
1
Other values (125081)
125081 

Length

Max length10
Median length9
Mean length9.1504565
Min length9

Characters and Unicode

Total characters1144594
Distinct characters11
Distinct categories2 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique125086 ?
Unique (%)100.0%

Sample

1st rowtt0000001
2nd rowtt0000007
3rd rowtt0000009
4th rowtt0000031
5th rowtt0000061

Common Values

ValueCountFrequency (%)
tt0000001 1
 
< 0.1%
tt2177418 1
 
< 0.1%
tt2177318 1
 
< 0.1%
tt2177167 1
 
< 0.1%
tt2177145 1
 
< 0.1%
tt2177125 1
 
< 0.1%
tt2177115 1
 
< 0.1%
tt2177113 1
 
< 0.1%
tt2177007 1
 
< 0.1%
tt2176948 1
 
< 0.1%
Other values (125076) 125076
> 99.9%

Length

Histogram of lengths of the category
ValueCountFrequency (%)
tt0000001 1
 
< 0.1%
tt0000031 1
 
< 0.1%
tt0000085 1
 
< 0.1%
tt0000089 1
 
< 0.1%
tt0000096 1
 
< 0.1%
tt0000102 1
 
< 0.1%
tt0000107 1
 
< 0.1%
tt0000119 1
 
< 0.1%
tt0000132 1
 
< 0.1%
tt0000135 1
 
< 0.1%
Other values (125076) 125076
> 99.9%

Most occurring characters

ValueCountFrequency (%)
t 250172
21.9%
0 132465
11.6%
1 111401
9.7%
2 95034
 
8.3%
4 88852
 
7.8%
6 86315
 
7.5%
8 82800
 
7.2%
3 79201
 
6.9%
5 76556
 
6.7%
7 72701
 
6.4%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 894422
78.1%
Lowercase Letter 250172
 
21.9%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
0 132465
14.8%
1 111401
12.5%
2 95034
10.6%
4 88852
9.9%
6 86315
9.7%
8 82800
9.3%
3 79201
8.9%
5 76556
8.6%
7 72701
8.1%
9 69097
7.7%
Lowercase Letter
ValueCountFrequency (%)
t 250172
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 894422
78.1%
Latin 250172
 
21.9%

Most frequent character per script

Common
ValueCountFrequency (%)
0 132465
14.8%
1 111401
12.5%
2 95034
10.6%
4 88852
9.9%
6 86315
9.7%
8 82800
9.3%
3 79201
8.9%
5 76556
8.6%
7 72701
8.1%
9 69097
7.7%
Latin
ValueCountFrequency (%)
t 250172
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 1144594
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
t 250172
21.9%
0 132465
11.6%
1 111401
9.7%
2 95034
 
8.3%
4 88852
 
7.8%
6 86315
 
7.5%
8 82800
 
7.2%
3 79201
 
6.9%
5 76556
 
6.7%
7 72701
 
6.4%

averageRating
Real number (ℝ)

Distinct91
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.9495467
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size977.4 KiB

Quantile statistics

Minimum1
5-th percentile4.4
Q16.2
median7.1
Q37.9
95-th percentile8.9
Maximum10
Range9
Interquartile range (IQR)1.7

Descriptive statistics

Standard deviation1.3900248
Coefficient of variation (CV)0.20001662
Kurtosis1.114094
Mean6.9495467
Median Absolute Deviation (MAD)0.8
Skewness-0.79938617
Sum869291
Variance1.9321691
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
7.2 4627
 
3.7%
7.6 4470
 
3.6%
7.8 4380
 
3.5%
7.4 4379
 
3.5%
7 4321
 
3.5%
7.5 4107
 
3.3%
8 4043
 
3.2%
6.8 3956
 
3.2%
7.3 3941
 
3.2%
7.7 3801
 
3.0%
Other values (81) 83061
66.4%
ValueCountFrequency (%)
1 109
0.1%
1.1 18
 
< 0.1%
1.2 45
< 0.1%
1.3 32
 
< 0.1%
1.4 31
 
< 0.1%
1.5 44
< 0.1%
1.6 49
< 0.1%
1.7 45
< 0.1%
1.8 45
< 0.1%
1.9 48
< 0.1%
ValueCountFrequency (%)
10 566
0.5%
9.9 142
 
0.1%
9.8 293
 
0.2%
9.7 232
 
0.2%
9.6 408
0.3%
9.5 315
 
0.3%
9.4 537
0.4%
9.3 517
0.4%
9.2 927
0.7%
9.1 715
0.6%

numVotes
Real number (ℝ)

Distinct5737
Distinct (%)4.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1071.9301
Minimum5
Maximum2038650
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size977.4 KiB

Quantile statistics

Minimum5
5-th percentile6
Q112
median26
Q3103
95-th percentile1375.75
Maximum2038650
Range2038645
Interquartile range (IQR)91

Descriptive statistics

Standard deviation16649.255
Coefficient of variation (CV)15.532034
Kurtosis3508.4993
Mean1071.9301
Median Absolute Deviation (MAD)18
Skewness47.577186
Sum1.3408345 × 108
Variance2.7719768 × 108
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
7 5165
 
4.1%
8 5056
 
4.0%
6 4818
 
3.9%
9 4786
 
3.8%
10 4289
 
3.4%
11 3954
 
3.2%
12 3622
 
2.9%
13 3253
 
2.6%
14 3094
 
2.5%
5 3017
 
2.4%
Other values (5727) 84032
67.2%
ValueCountFrequency (%)
5 3017
2.4%
6 4818
3.9%
7 5165
4.1%
8 5056
4.0%
9 4786
3.8%
10 4289
3.4%
11 3954
3.2%
12 3622
2.9%
13 3253
2.6%
14 3094
2.5%
ValueCountFrequency (%)
2038650 1
< 0.1%
1397275 1
< 0.1%
1233535 1
< 0.1%
1173251 1
< 0.1%
1111258 1
< 0.1%
1035355 1
< 0.1%
852313 1
< 0.1%
815591 1
< 0.1%
815589 1
< 0.1%
776363 1
< 0.1%

Interactions

Correlations

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:
  • Variable_type-Variable_type : Method, Range
  • Categorical-Categorical : Cramer's V, [0,1]
  • Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
  • Numerical-Numerical : Spearman's ρ, [-1,1]
The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

tconstaverageRatingnumVotes
0tt00000015.71924
1tt00000075.4798
2tt00000095.3200
3tt00000315.5982
4tt00000613.924
5tt00000854.733
6tt00000896.2982
7tt00000964.329
8tt00001024.626
9tt00001075.228
tconstaverageRatingnumVotes
125076tt99151566.448
125077tt99156468.822
125078tt99157548.97
125079tt99157806.548
125080tt99158247.944
125081tt99158647.6105
125082tt99159127.7318
125083tt99161606.449
125084tt99162705.81376
125085tt99167787.335